Logistic Regression and the stock market.
May 14, 2017 By Data Scientist PakinJa
We will develop a logistic regression example. The exercise was originally published in “An Introduction to Statistical Learning. With applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.
The example we will develop is about predicting when the market value will rise (UP) or fall (Down).
We will carry out the exercise verbatim as published in the aforementioned reference and only with slight changes in the coding style.
For more details on the models, algorithms and parameters interpretation, it is recommended to check the aforementioned reference or any other bibliography of your choice.
“An Introduction to Statistical Learning. With applications in R” by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. Springer 2015.
install and load required packages
library(ISLR)
library(psych)
library(knitr)
explore the dataset
Smarket
names(Smarket)
[1] "Year" "Lag1" "Lag2" "Lag3"
[5] "Lag4" "Lag5" "Volume" "Today"
[9] "Direction"
dim(Smarket)
[1] 1250 9
summary(Smarket)
Year Lag1 Lag2
Min. :2001 Min. :-4.922000 Min. :-4.922000
1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
Median :2003 Median : 0.039000 Median : 0.039000
Mean :2003 Mean : 0.003834 Mean : 0.003919
3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
Max. :2005 Max. : 5.733000 Max. : 5.733000
Lag3 Lag4 Lag5
Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
Median : 0.038500 Median : 0.038500 Median : 0.03850
Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
Volume Today Direction
Min. :0.3561 Min. :-4.922000 Down:602
1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
Median :1.4229 Median : 0.038500
Mean :1.4783 Mean : 0.003138
3rd Qu.:1.6417 3rd Qu.: 0.596750
Max. :3.1525 Max. : 5.733000
kable(head(Smarket))
| Year | Lag1 | Lag2 | Lag3 | Lag4 | Lag5 | Volume | Today | Direction |
|---|---|---|---|---|---|---|---|---|
| 2001 | 0.381 | -0.192 | -2.624 | -1.055 | 5.010 | 1.1913 | 0.959 | Up |
| 2001 | 0.959 | 0.381 | -0.192 | -2.624 | -1.055 | 1.2965 | 1.032 | Up |
| 2001 | 1.032 | 0.959 | 0.381 | -0.192 | -2.624 | 1.4112 | -0.623 | Down |
| 2001 | -0.623 | 1.032 | 0.959 | 0.381 | -0.192 | 1.2760 | 0.614 | Up |
| 2001 | 0.614 | -0.623 | 1.032 | 0.959 | 0.381 | 1.2057 | 0.213 | Up |
| 2001 | 0.213 | 0.614 | -0.623 | 1.032 | 0.959 | 1.3491 | 1.392 | Up |
correlation matrix
cor(Smarket[,-9])
Year Lag1 Lag2 Lag3
Year 1.00000000 0.029699649 0.030596422 0.033194581
Lag1 0.02969965 1.000000000 -0.026294328 -0.010803402
Lag2 0.03059642 -0.026294328 1.000000000 -0.025896670
Lag3 0.03319458 -0.010803402 -0.025896670 1.000000000
Lag4 0.03568872 -0.002985911 -0.010853533 -0.024051036
Lag5 0.02978799 -0.005674606 -0.003557949 -0.018808338
Volume 0.53900647 0.040909908 -0.043383215 -0.041823686
Today 0.03009523 -0.026155045 -0.010250033 -0.002447647
Lag4 Lag5 Volume Today
Year 0.035688718 0.029787995 0.53900647 0.030095229
Lag1 -0.002985911 -0.005674606 0.04090991 -0.026155045
Lag2 -0.010853533 -0.003557949 -0.04338321 -0.010250033
Lag3 -0.024051036 -0.018808338 -0.04182369 -0.002447647
Lag4 1.000000000 -0.027083641 -0.04841425 -0.006899527
Lag5 -0.027083641 1.000000000 -0.02200231 -0.034860083
Volume -0.048414246 -0.022002315 1.00000000 0.014591823
Today -0.006899527 -0.034860083 0.01459182 1.000000000
correlations between th lag variables and today returns are close to zero the only substantial correlation is between Year and Volume.
plot(Smarket$Volume, main = "Stock Market Data", ylab = "Volume", col = "blue")
scatterplots, distributions and correlations
pairs.panels(Smarket)
fit a logistic regression model to predict $Direction using $Lag1 through $Lag5 and $Volume glm(): generalized linear model function family=binomial => logistic regression
glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = Smarket, family = binomial)
summary(glm.fit)
Call:
glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Smarket)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.446 -1.203 1.065 1.145 1.326
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000 0.240736 -0.523 0.601
Lag1 -0.073074 0.050167 -1.457 0.145
Lag2 -0.042301 0.050086 -0.845 0.398
Lag3 0.011085 0.049939 0.222 0.824
Lag4 0.009359 0.049974 0.187 0.851
Lag5 0.010313 0.049511 0.208 0.835
Volume 0.135441 0.158360 0.855 0.392
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1731.2 on 1249 degrees of freedom
Residual deviance: 1727.6 on 1243 degrees of freedom
AIC: 1741.6
Number of Fisher Scoring iterations: 3
the smallest p_value is associated with Lag1 the negative coefficient for this predictor suggests that if the market had a positive return yesterday, then it is less likely to go up today at a value of 0.15, the p-value is still relatively large, and so there is no clear evidence of a real association between $Lag1 and $Direction
explore fitted model coefficients
coef(glm.fit)
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
-0.126000257 -0.073073746 -0.042301344 0.011085108 0.009358938 0.010313068
Volume
0.135440659
summary(glm.fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.126000257 0.24073574 -0.5233966 0.6006983
Lag1 -0.073073746 0.05016739 -1.4565986 0.1452272
Lag2 -0.042301344 0.05008605 -0.8445733 0.3983491
Lag3 0.011085108 0.04993854 0.2219750 0.8243333
Lag4 0.009358938 0.04997413 0.1872757 0.8514445
Lag5 0.010313068 0.04951146 0.2082966 0.8349974
Volume 0.135440659 0.15835970 0.8552723 0.3924004
summary(glm.fit)$coef[ ,4]
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5 Volume
0.6006983 0.1452272 0.3983491 0.8243333 0.8514445 0.8349974 0.3924004
predict the probability that the market will go up, given values of the predictors
glm.probs <- predict(glm.fit, type = "response")
glm.probs[1:10]
1 2 3 4 5 6 7 8 9
0.5070841 0.4814679 0.4811388 0.5152224 0.5107812 0.5069565 0.4926509 0.5092292 0.5176135
10
0.4888378
contrasts(Smarket$Direction)
Up
Down 0
Up 1
These values correspond to the probability of the marketgoing up, rather than down, because the contrasts() function indicates that R has created a dummy variable with a 1 for Up.
Create a vector of class predictions based on whether the predicted probability of a market increase is greater than or less than 0.5.
glm.pred <- rep ("Down", 1250)
glm.pred[glm.probs > .5] <- "Up"
Confusion matrix in order to determine how many observations were correctly or incorrectly classified.
table(glm.pred, Smarket$Direction)
glm.pred Down Up
Down 145 141
Up 457 507
mean(glm.pred == Smarket$Direction)
[1] 0.5216
Model correctly predicted that the market would go up on 507 days and that it would go down on 145 days, for a total of 507 + 145 = 652 correct predictions. Logistic regression correctly predicted the movement of the market 52.2 % of the time.
To better assess the accuracy of the logistic regression model in this setting, we can fit the model using part of the data, and then examine how well it predicts the held out data.
train <- (Smarket$Year < 2005)
Smarket.2005 <- Smarket[!train, ]
dim(Smarket.2005)
Direction.2005 <- Smarket$Direction[!train]
Direction.2005
glm.fit <- glm(Direction~Lag1+Lag2+Lag3+Lag4+Lag5+Volume,
data = Smarket, family = binomial, subset = train)
glm.fit
Call: glm(formula = Direction ~ Lag1 + Lag2 + Lag3 + Lag4 + Lag5 +
Volume, family = binomial, data = Smarket, subset = train)
Coefficients:
(Intercept) Lag1 Lag2 Lag3 Lag4 Lag5
0.191213 -0.054178 -0.045805 0.007200 0.006441 -0.004223
Volume
-0.116257
Degrees of Freedom: 997 Total (i.e. Null); 991 Residual
Null Deviance: 1383
Residual Deviance: 1381 AIC: 1395
glm.probs <- predict(glm.fit, Smarket.2005, type = "response")
Compute the predictions for 2005 and compare them to the actual movements of the market over that time period.
glm.pred <- rep("Down", 252)
glm.pred
[1] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[13] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[25] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[37] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[49] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[61] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[73] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[85] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[97] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[109] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[121] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[133] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[145] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[157] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[169] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[181] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[193] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[205] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[217] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[229] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
[241] "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down" "Down"
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Direction.2005)
Direction.2005
glm.pred Down Up
Down 77 97
Up 34 44
mean(glm.pred == Direction.2005)
[1] 0.4801587
mean(glm.pred != Direction.2005)
[1] 0.5198413
Not generally expect to be able to use previous days returns to predict future market performance.
Refit the logistic regression using just $Lag1 and $Lag2, which seemed to have the highest predictive power in the original logistic regression model.
glm.fit <- glm(Direction ~ Lag1 + Lag2 , data = Smarket,
family = binomial, subset = train)
glm.probs <- predict(glm.fit, Smarket.2005 , type = "response")
glm.pred <- rep("Down", 252)
glm.pred[glm.probs > 0.5] <- "Up"
table(glm.pred, Direction.2005)
Direction.2005
glm.pred Down Up
Down 35 35
Up 76 106
mean(glm.pred == Direction.2005)
[1] 0.5595238
Results appear to be a little better: 56% If we want to predict the returns associated with particular values of $Lag1 and $Lag2
predict(glm.fit, newdata = data.frame(Lag1 = c (1.2 ,1.5),
Lag2 = c(1.1, -0.8)) , type = "response")
1 2
0.4791462 0.4960939
library(psych)
Read and explore the data
insurance <- read.csv("insurance.csv", header = T)
head(insurance)
str(insurance)
'data.frame': 1338 obs. of 7 variables:
$ age : int 19 18 28 33 32 31 46 37 37 60 ...
$ sex : Factor w/ 2 levels "female","male": 1 2 2 2 2 1 1 1 2 1 ...
$ bmi : num 27.9 33.8 33 22.7 28.9 ...
$ children: int 0 1 3 0 0 0 1 3 2 0 ...
$ smoker : Factor w/ 2 levels "no","yes": 2 1 1 1 1 1 1 1 1 1 ...
$ region : Factor w/ 4 levels "northeast","northwest",..: 4 3 3 2 2 3 3 2 1 2 ...
$ charges : num 16885 1726 4449 21984 3867 ...
Model dependent variable: $expenses
### change $charges name to $expenses
colnames(insurance)[7] <- "expenses"
summary(insurance$expenses)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1122 4740 9382 13270 16640 63770
hist(insurance$expenses, main = "Insurance Expenses", col = "red",
xlab = "Expenses (USD")
### explore $region
table(insurance$region)
northeast northwest southeast southwest
324 325 364 325
Exploring relationships among features
### correlation matrix
cor(insurance[c("age", "bmi", "children", "expenses")])
age bmi children
age 1.0000000 0.1092719 0.04246900
bmi 0.1092719 1.0000000 0.01275890
children 0.0424690 0.0127589 1.00000000
expenses 0.2990082 0.1983410 0.06799823
expenses
age 0.29900819
bmi 0.19834097
children 0.06799823
expenses 1.00000000
Visualizing relationships among features.
### scatterplot matrix
pairs(insurance[c("age", "bmi", "children", "expenses")])
### scatterplots, distributions and correlations
pairs.panels(insurance[c("age", "bmi", "children", "expenses")])
ins_model
Call:
lm(formula = expenses ~ ., data = insurance)
Coefficients:
(Intercept) age
-11938.5 256.9
sexmale bmi
-131.3 339.2
children smokeryes
475.5 23848.5
regionnorthwest regionsoutheast
-353.0 -1035.0
regionsouthwest
-960.1
### evaluating model performance
summary(ins_model)
Call:
lm(formula = expenses ~ ., data = insurance)
Residuals:
Min 1Q Median 3Q Max
-11304.9 -2848.1 -982.1 1393.9 29992.8
Coefficients:
Estimate Std. Error t value
(Intercept) -11938.5 987.8 -12.086
age 256.9 11.9 21.587
sexmale -131.3 332.9 -0.394
bmi 339.2 28.6 11.860
children 475.5 137.8 3.451
smokeryes 23848.5 413.1 57.723
regionnorthwest -353.0 476.3 -0.741
regionsoutheast -1035.0 478.7 -2.162
regionsouthwest -960.0 477.9 -2.009
Pr(>|t|)
(Intercept) < 2e-16 ***
age < 2e-16 ***
sexmale 0.693348
bmi < 2e-16 ***
children 0.000577 ***
smokeryes < 2e-16 ***
regionnorthwest 0.458769
regionsoutheast 0.030782 *
regionsouthwest 0.044765 *
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6062 on 1329 degrees of freedom
Multiple R-squared: 0.7509, Adjusted R-squared: 0.7494
F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16
The model explains 74.9% of the variation of the dependent variable (adjusted R-squared: 0.7494).
Improving model performance
### adding non-linear relationships
### adding second order term on $age
insurance$age2 <- insurance$age^2
Converting a numeric variable to a binary indicator
insurance$bmi30 <- ifelse(insurance$bmi >= 30, 1, 0)
Putting it all together
### improved regression model
ins_model2 <- lm(expenses ~ age + age2 + children + bmi + sex + bmi30*smoker + region, data = insurance)
summary(ins_model2)
Call:
lm(formula = expenses ~ age + age2 + children + bmi + sex + bmi30 *
smoker + region, data = insurance)
Residuals:
Min 1Q Median 3Q Max
-17296.4 -1656.0 -1263.3 -722.1 24160.2
Coefficients:
Estimate Std. Error t value
(Intercept) 134.2509 1362.7511 0.099
age -32.6851 59.8242 -0.546
age2 3.7316 0.7463 5.000
children 678.5612 105.8831 6.409
bmi 120.0196 34.2660 3.503
sexmale -496.8245 244.3659 -2.033
bmi30 -1000.1403 422.8402 -2.365
smokeryes 13404.6866 439.9491 30.469
regionnorthwest -279.2038 349.2746 -0.799
regionsoutheast -828.5467 351.6352 -2.356
regionsouthwest -1222.6437 350.5285 -3.488
bmi30:smokeryes 19810.7533 604.6567 32.764
Pr(>|t|)
(Intercept) 0.921539
age 0.584915
age2 6.50e-07 ***
children 2.04e-10 ***
bmi 0.000476 ***
sexmale 0.042240 *
bmi30 0.018159 *
smokeryes < 2e-16 ***
regionnorthwest 0.424212
regionsoutheast 0.018604 *
regionsouthwest 0.000503 ***
bmi30:smokeryes < 2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4445 on 1326 degrees of freedom
Multiple R-squared: 0.8664, Adjusted R-squared: 0.8653
F-statistic: 781.7 on 11 and 1326 DF, p-value: < 2.2e-16
The accuracy of the model has improved to an 86.5% of explanation of the variation of the independent variable.
The iris dataset contains data about sepal length, sepal width, petal length, and petal width of flowers of different species. Let us see what it looks like:
library(datasets)
head(iris)
After a little bit of exploration, I found that Petal.Length and Petal.Width were similar among the same species but varied considerably between different species, as demonstrated below:
library(ggplot2)
Find out what's changed in ggplot2 at
http://github.com/tidyverse/ggplot2/releases.
Attaching package: 㤼㸱ggplot2㤼㸲
The following objects are masked from 㤼㸱package:psych㤼㸲:
%+%, alpha
ggplot(iris, aes(Petal.Length, Petal.Width, color = Species)) + geom_point()
Clustering
Okay, now that we have seen the data, let us try to cluster it. Since the initial cluster assignments are random, let us set the seed to ensure reproducibility.
set.seed(20)
irisCluster <- kmeans(iris[, 3:4], 3, nstart = 20)
irisCluster
K-means clustering with 3 clusters of sizes 50, 52, 48
Cluster means:
Petal.Length Petal.Width
1 1.462000 0.246000
2 4.269231 1.342308
3 5.595833 2.037500
Clustering vector:
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[23] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
[45] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[67] 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 3 2 2 2 2
[89] 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 2 3 3 3
[111] 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 2 3 3 3 3 3
[133] 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3
Within cluster sum of squares by cluster:
[1] 2.02200 13.05769 16.29167
(between_SS / total_SS = 94.3 %)
Available components:
[1] "cluster" "centers" "totss"
[4] "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
Since we know that there are 3 species involved, we ask the algorithm to group the data into 3 clusters, and since the starting assignments are random, we specify nstart = 20. This means that R will try 20 different random starting assignments and then select the one with the lowest within cluster variation. We can see the cluster centroids, the clusters that each data point was assigned to, and the within cluster variation.
Let us compare the clusters with the species.
table(irisCluster$cluster, iris$Species)
setosa versicolor virginica
1 50 0 0
2 0 48 4
3 0 2 46
As we can see, the data belonging to the setosa species got grouped into cluster 3, versicolor into cluster 2, and virginica into cluster 1. The algorithm wrongly classified two data points belonging to versicolor and six data points belonging to virginica.
We can also plot the data to see the clusters:
irisCluster$cluster <- as.factor(irisCluster$cluster)
ggplot(iris, aes(Petal.Length, Petal.Width, color = irisCluster$cluster)) + geom_point()
That brings us to the end of the article. I hope you enjoyed it! If you have any questions or feedback, feel free to leave a comment or reach out to me on Twitter.
bigMart <- read.csv("BigMartData.csv", header = T,stringsAsFactors = F)
bigMart
str(bigMart)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : chr "FDA15" "DRC01" "FDN15" "FDX07" ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : chr "Low Fat" "Regular" "Low Fat" "Regular" ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : chr "Dairy" "Soft Drinks" "Meat" "Fruits and Vegetables" ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : chr "OUT049" "OUT018" "OUT049" "OUT010" ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : chr "Medium" "Medium" "Medium" "" ...
$ Outlet_Location_Type : chr "Tier 1" "Tier 3" "Tier 1" "Tier 3" ...
$ Outlet_Type : chr "Supermarket Type1" "Supermarket Type2" "Supermarket Type1" "Grocery Store" ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...
ggplot(bigMart, aes(Item_Visibility, Item_MRP, group = Item_Type, color = Item_Type)) + geom_point() + scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+ scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+ theme_bw()
Now, we can view a third variable also in same chart, say a categorical variable (Item_Type) which will give the characteristic (item_type) of each data set. Different categories are depicted by way of different color for item_type in below chart.
We can even make it more visually clear by creating separate scatter plots for each separate Item_Type as shown below.
ggplot(bigMart, aes(Item_Visibility, Item_MRP)) + geom_point(aes(color = Item_Type)) +
scale_x_continuous("Item Visibility", breaks = seq(0,0.35,0.05))+
scale_y_continuous("Item MRP", breaks = seq(0,270,by = 30))+
theme_bw() + labs(title="Scatterplot") + facet_wrap(~ Item_Type)
When to use: Histogram is used to plot continuous variable. It breaks the data into bins and shows frequency distribution of these bins. We can always change the bin size and see the effect it has on visualization.
From our mart dataset, if we want to know the count of items on basis of their cost, then we can plot histogram using continuous variable Item_MRP as shown below.
ggplot(bigMart, aes(Item_MRP)) + geom_histogram(binwidth = 2)+
scale_x_continuous("Item MRP", breaks = seq(0,270,by = 30))+
scale_y_continuous("Count", breaks = seq(0,200,by = 20))+
labs(title = "Histogram")
When to use: Bar charts are recommended when you want to plot a categorical variable or a combination of continuous and categorical variable.
From our dataset, if we want to know number of marts established in particular year, then bar chart would be most suitable option, use variable Establishment Year as shown below.
ggplot(bigMart, aes(Outlet_Establishment_Year)) + geom_bar(fill = "red")+theme_bw()+
scale_x_continuous("Establishment Year", breaks = seq(1985,2010)) +
scale_y_continuous("Count", breaks = seq(0,1500,150)) +
coord_flip()+ labs(title = "Bar Chart") + theme_gray()
ggplot(bigMart, aes(Item_Type, Item_Weight)) + geom_bar(stat = "identity", fill = "darkblue") + scale_x_discrete("Outlet Type")+ scale_y_continuous("Item Weight", breaks = seq(0,15000, by = 500))+ theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Bar Chart")
Stacked Bar
ggplot(bigMart, aes(Outlet_Location_Type, fill = Outlet_Type)) + geom_bar()+
labs(title = "Stacked Bar Chart", x = "Outlet Location Type", y = "Count of Outlets")
From our dataset, if we want to identify each outlet’s detailed item sales including minimum, maximum & median numbers, box plot can be helpful. In addition, it also gives values of outliers of item sales for each outlet as shown in below chart.
The black points are outliers. Outlier detection and removal is an essential step of successful data exploration.
ggplot(bigMart, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_boxplot(fill = "red")+
scale_y_continuous("Item Outlet Sales", breaks= seq(0,15000, by=500))+
labs(title = "Box Plot", x = "Outlet Identifier")
From our dataset, when we want to analyze the trend of item outlet sales, area chart can be plotted as shown below. It shows count of outlets on basis of sales.
ggplot(bigMart, aes(Item_Outlet_Sales)) + geom_area(stat = "bin", bins = 30, fill = "steelblue") + scale_x_continuous(breaks = seq(0, 11000, 1000)) + labs(title = "Area Chart", x = "Item Outlet Sales", y = "Number of Outlets")
From our dataset, if we want to know cost of each item on every outlet, we can plot heatmap as shown below using three variables Item MRP, Outlet Identifier & Item Type from our mart dataset.
ggplot(bigMart, aes(Outlet_Identifier, Item_Type)) + geom_raster(aes(fill = Item_MRP)) + labs(title = "Heat Map", x = "Outlet Identifier", y = "Item Type") + scale_fill_continuous(name = "Item MRP")
Darker the color, higher the co-relation between variables. Positive co-relations are displayed in blue and negative correlations in red color. Color intensity is proportional to the co-relation value.
From our dataset, let’s check co-relation between Item cost, weight, visibility along with Outlet establishment year and Outlet sales from below plot.
In our example, we can see that Item cost & Outlet sales are positively correlated while Item weight & its visibility are negatively correlated.
library(corrgram)
corrgram(bigMart, order = NULL, panel = panel.shade, text.panel = panel.txt, main = "Correlogram")
head(M)
mpg cyl disp hp drat
mpg 1.0000000 -0.8521620 -0.8475514 -0.7761684 0.6811719
cyl -0.8521620 1.0000000 0.9020329 0.8324475 -0.6999381
disp -0.8475514 0.9020329 1.0000000 0.7909486 -0.7102139
hp -0.7761684 0.8324475 0.7909486 1.0000000 -0.4487591
drat 0.6811719 -0.6999381 -0.7102139 -0.4487591 1.0000000
wt -0.8676594 0.7824958 0.8879799 0.6587479 -0.7124406
wt qsec vs am gear
mpg -0.8676594 0.41868403 0.6640389 0.5998324 0.4802848
cyl 0.7824958 -0.59124207 -0.8108118 -0.5226070 -0.4926866
disp 0.8879799 -0.43369788 -0.7104159 -0.5912270 -0.5555692
hp 0.6587479 -0.70822339 -0.7230967 -0.2432043 -0.1257043
drat -0.7124406 0.09120476 0.4402785 0.7127111 0.6996101
wt 1.0000000 -0.17471588 -0.5549157 -0.6924953 -0.5832870
carb
mpg -0.5509251
cyl 0.5269883
disp 0.3949769
hp 0.7498125
drat -0.0907898
wt 0.4276059
# mat : is a matrix of data
# ... : further arguments to pass to the native R cor.test function
cor.mtest <- function(mat, ...) {
mat <- as.matrix(mat)
n <- ncol(mat)
p.mat<- matrix(NA, n, n)
diag(p.mat) <- 0
for (i in 1:(n - 1)) {
for (j in (i + 1):n) {
tmp <- cor.test(mat[, i], mat[, j], ...)
p.mat[i, j] <- p.mat[j, i] <- tmp$p.value
}
}
colnames(p.mat) <- rownames(p.mat) <- colnames(mat)
p.mat
}
# matrix of the p-value of the correlation
p.mat <- cor.mtest(mtcars)
head(p.mat[, 1:5])
mpg cyl disp hp
mpg 0.000000e+00 6.112687e-10 9.380327e-10 1.787835e-07
cyl 6.112687e-10 0.000000e+00 1.802838e-12 3.477861e-09
disp 9.380327e-10 1.802838e-12 0.000000e+00 7.142679e-08
hp 1.787835e-07 3.477861e-09 7.142679e-08 0.000000e+00
drat 1.776240e-05 8.244636e-06 5.282022e-06 9.988772e-03
wt 1.293959e-10 1.217567e-07 1.222320e-11 4.145827e-05
drat
mpg 1.776240e-05
cyl 8.244636e-06
disp 5.282022e-06
hp 9.988772e-03
drat 0.000000e+00
wt 4.784260e-06
Add significance level to the correlogram
# Specialized the insignificant value according to the significant level
corrplot(M, type="upper", order="hclust",
p.mat = p.mat, sig.level = 0.01)
# Leave blank on no significant coefficient
corrplot(M, type="upper", order="hclust",
p.mat = p.mat, sig.level = 0.01, insig = "blank")
bigMart, aes(, Item_Outlet_Sales